Calibrated Ensemble Forecasts Using Quantile Regression Forests and Ensemble Model Output Statistics

Ensembles used for probabilistic weather forecasting tend to be biased and underdispersive. This paper proposes a statistical method for postprocessing ensembles based on quantile regression forests (QRF), a generalization of random forests for quantile regression. This method does not ﬁt a parametric probability density function (PDF) like in ensemble model output statistics (EMOS) but provides an estimation of desired quantiles. This is a nonparametric approach that eliminates any assumption on the variable subject to calibration. This method can estimate quantiles using not only members of the ensemble but any predictor available including statistics on other variables. The method is applied to the M é t é o-France 35-member ensemble forecast (PEARP) for surface temperature and wind speed for available lead times from 3 up to 54h and compared to EMOS. All postprocessed ensembles are much better calibrated than the PEARP raw ensemble and experiments on real data also show that QRF performs better than EMOS, and can bring a real gain for human forecasters compared to EMOS. QRF provides sharp and reliable probabilistic forecasts. At last, classical scoring rules to verify predictive forecasts are completed by the introduction of entropy as a general measure of reliability.


Introduction
In recent years, meteorologists have seen the rise of ensemble forecasting in numerical weather prediction and its development in national meteorological services. Ensemble forecasting is clearly a necessary tool that complements deterministic forecast. Ensemble forecasts seek to represent and quantify different uncertainty sources in the forecast: observation errors or a mathematical representation of the atmosphere still incomplete. In practice ensemble forecasts tend to be biased and underdispersed (Hamill and Colucci 1997;Hamill and Whitaker 2006).
Several techniques for the statistical postprocessing of ensemble model output have been developed to square up to these shortcomings. Local quantile regression and probit regression were used for probabilistic forecasts of precipitation by Bremnes (2004). Other techniques of regression like censored quantile regression have been applied to extreme precipitation (Friederichs and Hense 2007) and logistic regression was employed for probabilistic forecasts of precipitation Wilks 2009;Ben Bouallègue 2013). Two approaches are baseline in postprocessing techniques: the Bayesian model averaging (BMA; Raftery et al. 2005) and the ensemble model output statistics (EMOS; Gneiting et al. 2005). Whereas the BMA predictive distribution is a mixture of PDF depending on the variable to calibrate, the EMOS technique fits a single PDF from a raw ensemble. All parameters of theses PDFs are generally fitted on a sliding training period. In meteorology, BMA has been studied for many variables such as surface temperature , quantitative precipitation (Sloughter et al. 2007), surface wind speed (Sloughter et al. 2010), or surface wind direction (Bao et al. 2010). Meanwhile EMOS techniques have been used for surface temperature Hagedorn et al. 2008), quantitative precipitation , surface wind speed (Thorarinsdottir and Gneiting 2010; Baran and Lerch 2015), wind vectors (Pinson 2012;Schuhen et al. 2012), or peak wind (Friederichs and Thorarinsdottir 2012). More recently, Hemri et al. (2014) have applied EMOS to many variables.
In this paper we define a new nonparametric postprocessing method based on quantile regression forests (QRF) developed by Meinshausen (2006). Our QRF method will be compared to EMOS, which is efficient and simple to implement in an operational context by national meteorological services. The QRF technique has already been used by Juban et al. (2007) for wind energy and by Zamo et al. (2014b) for photovoltaic electricity production.
The paper is organized as follows: in section 2 we describe the QRF technique in detail and we do a quick review of the EMOS technique. We explain how we verify ensemble forecasts. Guided by Gneiting et al. (2007) we apply tools like rank histograms and indices to quantify their behavior, in particular we introduce entropy for verification of reliability. Scoring rules like the continuous ranked probability score (CRPS) is also presented to assess both reliability and sharpness. Section 3 presents a case study comparing postprocessing techniques for surface temperature and surface wind speed over 87 French locations at 18 lead times using observations and the French ensemble forecast system of Météo-France called PEARP (Descamps et al. 2015). Data comprise 4 years between 1 January 2011 and 31 December 2014 using initializations at 1800 UTC. Section 4 shows general results of postprocessing techniques for studied variables. The QRF forecast and more particularly QRF forecasts based on multivariable predictors are better calibrated than EMOS forecasts and bring a real gain in comparison to this technique. The paper closes with a discussion in section 5.

a. Quantile regression forests
For a calibration purpose the QRF method can be linked with the method of analogs (Hamill and Whitaker 2006;Delle Monache et al. 2013): its goal is to aggregate meteorological situations according to their forecasts, assuming that close forecasts lead to close observations. So, our QRF method aggregates observations according to their forecasts by iterative binary splitting on predictors. At the end we have for every meteorological situation restored a group of observations that creates an empirical cumulative distribution function (CDF). This method requires a large learning sample but has the advantages of being nonlinear and to potentially use others predictors than only the raw ensemble forecast.
We now describe the QRF method and explain the different means used to verify our ensemble forecasts. Let us remember that a quantile of order a is a value x a such that the probability that the random variable will be less than x a is a. Thus, a is the value of the CDF for x a : (1) While classical regression techniques allow us to estimate the conditional mean of a response variable, quantile regression allows us to estimate the conditional median or any other quantile of the response variable given a set of predictors (Koenker and Bassett 1978). Quantile regression such as QRF consists in building random forests from binary decision trees called classification and regression trees (CART), which are presented below. This is a nonlinear approach.

1) DECISION TREES (CART)
This technique (Breiman et al. 1984) consists in building binary decision trees whose interpretation is very easy. Zamo et al. (2014a) explain this technique in detail. The binary decision tree method consists in an iterative split of the data into two groups. This split is done according to some threshold of one of the predictors for quantitative predictors or according to some groups of modalities for qualitative predictors. The predictor and the threshold or grouping are chosen in order to maximize the homogeneity of the corresponding values of the response variable in each of the resulting groups. Homogeneity is defined as the sum of variances of the response variable within each groups: let D 0 be a group to split and D 1 and D 2 the two resulting groups. The variance of a group is (2) With t the threshold or grouping for a predictor in the predictors' space E , we define the homogeneity as Hðt, D 0 Þ 5 yðD 0 Þ 2 ½yðD 1 Þ 1 yðD 2 Þ $ 0: And we choose t such as Each resulting group is itself split into two, and so on until some stropping criterion is reached, which can be a minimum number of data or an insufficient decrease in the resulting groups' variance. Finally, for each final group (called leaves), the predicted value is the mean of observed values of the variable response belonging to the leaf. To avoid overfitting, binary trees are pruned at the splitting level that minimizes the squared error loss function estimated by cross validation. When one is faced with a new prediction situation, one follows the path in the tree with the value of the situation's predictors until a final leaf is reached. The forecast value is the mean of the predictand's values grouped in this leaf. Binary regression trees are easily interpretable because they can be represented by a decision tree, each node being the criterion used to split the data and each final leaf giving the predicted value. The interested reader can refer to Hastie et al. (2009, 305-312, 587-602) for detailed explanations.

2) BOOTSTRAP AGGREGATING (BAGGING)
According to the previous scheme, a tree can be a very unstable model (i.e., very dependent on the learning sample used for estimation). Breiman (1996) proposed to grow several trees and to average their predicted values to yield a more stable final prediction. This would require a lot of data in order to build enough independent trees. Since such a big amount of data is usually not available, bootstrap samples are usually used to build the trees. This means that artificial samples of data are simulated by randomly drawing with replacement among the original data. The complexity of the model is tuned with the number of bagged trees, and each individual tree is not pruned. The principle of bagging can be applied to other regression methods than binary trees.

3) RANDOM FORESTS
Since the binary trees used in bagging are built from the same data, they are not statistically independent and the variance of their mean cannot be indefinitely decreased. To make the bagged trees more independent, Breiman (2001) proposed to add another randomization step to bagging. Each split of each bagged tree is built on a random subset of the predictors. Hence, this method is called random forest. As in bagging, the overfitting problem is solved by tuning the number of trees.

4) QUANTILE REGRESSION FORESTS
Quantile regression forests (Meinshausen 2006) are a generalization of random forests and give a robust, nonlinear, and nonparametric way of estimating conditional quantiles. Whereas random forests approximate the conditional mean, quantile regression forests deliver an approximation of the full conditional distribution. In the same way as random forests, a quantile regression forest is a set of binary regression trees. But for each final leaf of each tree, one does not compute the mean of the predictand's values but instead their empirical CDF. Once the random forest is built, one determines for a new vector of predictors its associated leaf in each tree by following the binary splitting. Then the final forecast is the CDF computed by averaging the CDF from all the trees. Thus, predictive quantiles are directly obtained from the CDF. By construction, the final CDF is bounded between the lowest and the highest value of the learning sample. For example, it is not possible to forecast a negative quantile of wind speed and QRF is unable to forecast a quantile higher than the maximum measured in the training sample.

5) MODEL FITTING
The QRF method is used with different inputs here. The first, called QRT_O, uses as predictors only those statistics on the variable. The second, called QRT_M, contains not only statistics on the variable to calibrate but also on other meteorological variables issued from the ensemble: this is a multivariable approach. The lists of predictors are given in appendix A. For these variants, one must fit the number of trees and the size of the leaves. For temperature, the final leaf size is set to 10 and the number of tree is set to 300, which is a good compromise between quality and computation speed. For wind speed, the final leaf size is 20 and the number of trees is set to 400. Note that these parameters are set empirically by means of cross validation (not shown here).

b. Ensemble model output statistics
A description of the EMOS technique is given in Gneiting and Katzfuss (2014). The EMOS predictive distribution is a single parametric PDF whose parameters depend on the ensemble values. For example, it could be a normal density, where the mean is a bias corrected affine function of the ensemble members and the variance is a dispersion-corrected affine function of the ensemble variance.

MODEL FITTING
The EMOS technique was used considering the highresolution forecast called ARPEGE (Courtier et al. 1991), with the control member of the raw ensemble and the mean of the raw ensemble as predictors as in Hemri et al. (2014). The parameter vector is estimated by means of a CRPS minimization over the moving training period. Following Scheuerer (2014) we use as the initialization vector for a day the vector issued from the optimization at the precedent day. The optimization process is stopped after a few iterations to avoid overfitting.
For surface temperature, distributions tried in EMOS are the normal distribution and the logistic distribution. We finally keep the normal distribution, which is classical for temperatures. For wind speed, distributions tested are the truncated normal, gamma, truncated logistic, and square root-transformed truncated normal following Hemri et al. (2014). This last model performs best and is kept throughout the study. The correct formula for the corresponding CRPS is given in appendix B and we use it for our study.
c. Assessing sharpness and calibration Gneiting et al. (2007) propose to evaluate predictive performance based on the paradigm of maximizing the sharpness of the predictive distributions subject to calibration. Calibration refers to the statistical consistency between forecasts and observations. Also called reliability this is a joint property of predictions and events that materialize. Sharpness refers to the spread of predictive distributions and is a property of the forecasts only. For example, a climatological forecast would be reliable, but would have a poor sharpness.

1) SHARPNESS
To assess sharpness, we use summaries of the width of prediction intervals as in Gneiting et al. (2007). For example, we can introduce the average width of the central 50% prediction interval, the 90% prediction interval, or both. In this study we check the width of the central 50% prediction interval only, we denote it as interquartile range (IQR) in the following results.
2) THE RANK HISTOGRAM AND THE PIT HISTOGRAM Rank histograms (RH), also called Talagrand diagrams were developed independently by Anderson (1996), Talagrand et al. (1997), and Hamill and Colucci (1997). We employ RH to check the reliability of an ensemble forecast or a set of quantiles. An RH is built by ranking observations according to associated forecasts. Reliability implies that each rank should be filled with the same probability. Calibrated ensemble prediction systems should result in a flat RH. The opposite is not true: a flat RH may not refer to a calibrated system (Hamill 2001). In a general way, a U-shaped histogram refers to underdispersion or conditional bias, a domeshaped generally refers to overdispersion, while a nonsymmetric histogram refers to bias. A PIT histogram is the continuous version of the RH and permits to check reliability between observations and a predictive distribution by calculating Z 0 5 F(Y), where Y is the observation and F is the CDF of the associated predictive distribution. Subject to calibration, the random variable Z 0 has a standard uniform distribution (Gneiting and Katzfuss 2014) and we can check ensemble bias by comparing E(Z 0 ) to 1/2 and ensemble dispersion by comparing the variance var(Z 0 ) to 1/12. We apply this approach to a RH with K 1 1 ranks using the discrete random variable Z 5 [rank(y) 2 1]/K. Subject to calibration, Z has a discrete standard uniform distribution with E(Z) 5 1/2 and a normalized variance of V(Z) 5 12[K/(K 1 2)] var(Z) 5 1.
Moreover, Delle Monache et al. (2006) introduce the reliability or discrepancy index for a RH with K 1 1 ranks: where f i is the frequency of observations in the ith rank. We can complete this tool by checking k«k 2 (quadratic index) or k«k ' (max index), which are more sensitive to bigger errors than D.
Another tool that we will use to assess calibration is the entropy, called c here: For a calibrated system the entropy is maximum and equals 1. Tribus (1969) showed that entropy is a tool for estimating reliability and it is linked with the Bayesian psi test. Entropy is also a proper measure of reliability used in the divergence score described in Weijs et al. (2010).

3) RELIABILITY DIAGRAM
The reliability diagram (Wilks 1995) is a common graphical tool to evaluate and summarize probability forecasts of a binary event. We use the term probability because this tool evaluates a prediction based on a threshold exceedance for a given parameter (e.g., the frost probability). It consists of plotting observed frequencies against predicted probabilities. Subject to calibration, the resulting plot should be close to the first bisecting line. Nevertheless, this tool should be computed with a sufficient number of observations (which is the case in our study) as recalled by Bröcker and Smith (2007).

d. Scoring rules
Following Gneiting et al. (2007), Gneiting and Raftery (2007), and Gneiting and Katzfuss (2014), scoring rules assign numerical scores to probabilistic forecasts and form attractive summary measures of predictive performance, since they address calibration and sharpness simultaneously. These scores are usually taken to be negatively oriented and we wish to minimize them. A proper scoring rule is designed such that the expected value of the score is minimized when the observation is drawn from the same distribution than the predictive distribution.
Following Ferro et al. (2008), if F represents an ensemble forecast with members x 1 , . . . , x K 2 R, a so-called fair estimator of the CRPS (Ferro 2014 We can also define the skill score in term of CRPS between two ensemble prediction systems, in order to compare them directly: The value of the continuous ranked probability skill score (CRPSS) will be positive if and only if system A is better than system B for the CRPS scoring rule.
Some theoretical and analytic formulas for CRPS for several distributions are available in appendix B.

Analysis of the French operational ensemble forecast system (PEARP)
We now compare QRF and EMOS techniques for lead times from 3 up to 54 h for forecasts of surface temperature and wind speed over 87 French stations using observations and the French ensemble forecast system of Météo-France called PEARP (Descamps et al. 2015). Data comprise 4 yr between 1 January 2011 and 31 December 2014 using initializations at 1800 UTC. Verification and results are made over the years 2013 and 2014. The aim of our study is to compare both techniques according to their specificities and advantages: on the one hand the QRF method is nonparametric so it needs a large data sample for learning, which is why we employed a cross-validation method (each month of years 2013 and 2014 are retained as validation data for testing the model, while all 4 years of data without the forecasted month are used for learning). On the other FIG. 6. Mean scores with 95% bootstrap confidence intervals for all locations across lead times for surface temperature. QRF_M is the best technique for CRPS and CRPSS. Calibrated ensembles are unbiased and in general better dispersed than raw ensembles. QRF techniques tend to provide more reliable forecasts than EMOS (the raw ensemble entropy is around 0.75). The raw ensemble is the sharpest, but it is not reliable.

JUNE 2016
T A I L L A R D A T E T A L .
hand, a sliding period of the 40 last days prior the forecast output as in Gneiting et al. (2005), Schuhen et al. (2012), and Thorarinsdottir and Gneiting (2010) give good results for EMOS. But EMOS has to be tuned optimally for a fair comparison, which is why for temperature all the data available for each day (4 yr less the forecast day) with a seasonal dependence like in Hemri et al. (2014) are taken. For wind speed, a sliding period of 1 yr gives the best results for EMOS. For verification, we choose for all methods to form a K-member ensemble from predictive CDFs by taking forecast quantiles at level i/(K 1 1) for i 5 1, . . . , K,

a. Surface temperature
We now give results for surface temperature. We show an example for 36-h lead time (corresponding to 0600 UTC) at two locations which are Lyon and Paris-Orly airports in France. Figures 1 and 2 show RH for all presented methods. For both examples, the raw ensemble is biased and underdispersive whereas EMOS and QRF techniques show graphically good calibration. Table 1 confirms these first results. We can see that the raw ensemble is not reliable and has the worst CRPS. EMOS and QRF techniques are unbiased and dispersion is satisfying. In a general way, the lowest CRPS are for QRF_M. It is very interesting to notice that most of the time all indices of reliability (discrepancy index, quadratic index, max index, and entropy) exhibit the same rankings for the different models. Reliability for EMOS and QRF_O focuses only on the example of Paris-Orly. The discrepancy index shows a better reliability for QRF whereas other indexes penalize this. Thus, it is sometimes interesting to assess calibration with several tools. Now let us focus on all stations for a 36-h lead time. Figure 3 shows RH for the three techniques where a box plot represents the distribution of a rank for all stations. Results are satisfying, all the RHs are unbiased, but we have a ''wavy'' RH for EMOS whereas the RH for QRF techniques seems to be better. Nevertheless we can assume a slightly U-shaped RH for QRF_O and a slightly dome-shaped for QRF_M to be signs of an unperfect dispersion. These first remarks are strengthened by Fig. 4 where we see that the three calibration techniques are unbiased and QRF techniques are a little more reliable than EMOS technique for the discrepancy index (we only show this index of reliability here according to our previous remarks on indices of reliability), but we can assume that results are quite mixed now. The diagnosis of spread ensembles exhibits a slight FIG. 9. Rank histograms for the Paris-Orly airport for the 24-h forecast of surface wind speed. This time, raw ensemble is not biased but still underdispersed.

2384
IQR box plots for calibrated ensembles are taller than the raw ensemble. And last but not least, when we focus on the CRPS skill score computed with regard to QRF_M for each station we see that almost all the values of the different box plots are under 0: not only does QRF_M have a better CRPS in general but QRF_M is better in CRPS than all other ensembles and this is true for almost all stations in this study.
We also investigate performances of probabilistic forecasts of frost for all stations for the 36-h lead time. Figure 5 shows reliability diagrams for all ensembles. We can see very good performances of calibrated ensembles whereas raw ensemble tends to overpredict frost. This is not surprising since in Fig. 4 we see that the raw ensemble is essentially cold biased.
We continue this study on surface temperature by showing results across lead times in Fig. 6. We note that raw ensemble follows a diurnal cycle for all scores. This phenomenon is not shared by calibration techniques concerning reliability but just for CRPS and IQR: we conclude that reliability is not influenced by lead time for calibrated ensembles, only IQR is concerned and thus the CRPS. In addition, the very good entropy of calibrated ensembles (the raw ensemble entropy is around 0.75), which causes us to think that the gain is mainly in reliability. It is interesting to see that raw ensemble does not manage to conciliate good dispersion with small bias. Moreover, reliability of the raw ensemble tends to increase among lead times: indeed predictions are less sharp so they can manage to catch the observation. Besides, we note that calibrated ensembles still remain unbiased and reliable with a preference for QRF techniques concerning entropy and are quite well dispersed. The QRF_O technique is a little bit underdispersed and QRF_M is a little bit overdispersed but both are quite close to 1. Last but not least, we see for CRPS that QRF_O and EMOS are very similar and the gap with QRF_M tends to remain the same across lead times. We can explain the gain in CRPS by the introduction of predictors from other variables than surface temperature and show all the interest of QRF_M method regarding QRF_O. Now let us conclude by showing the interest of QRF techniques and in particular QRF_M technique for forecasters. In our opinion, the main issue of the EMOS technique is that it loses one of the main aims of ensemble forecasting, which is to assess different scenarios from different initial conditions (i.e., to build different trajectories that can converge or diverge in order to create meteorological scenarios). Indeed, EMOS technique fits a single and unimodal PDF and does not permit one to make alternative scenarios. In Fig. 7 Fig. 7 we have a situation with snowy ground and clear skies during the night causing a rapid cooling. Here, even if all calibrated ensembles give a mode around 248 we can see that QRF_M proposes cooler scenarios. The forecaster knowing this phenomenon of rapid cooling would choose this scenario to make a deterministic forecast for example. We can assume here that the combined predictors snowfall amount and surface irradiation permit one to detect a nonlinear phenomenon. For the forecast at Carcassonne, France, we see that the raw ensemble is bimodal. QRF_M technique is able to detect a situation conducting to a bimodality and so it fits a bimodal PDF (and if this bimodality is just an artifact it is an artifact now shared by the raw ensemble and the QRF_M ensemble). Moreover, observation corresponds to the first mode of QRF_M PDF whereas other calibrated ensembles are unimodal. It is the same case for Boulogne-sur-Mer, France: the bimodal raw ensemble leads to bimodal PDFs for QRF techniques (second modes are between 188 and 198) and the first mode is preferred and almost corresponds to observation. EMOS technique here fits the PDF in order to avoid mistakes and put its mean between the two raw ensemble modes. It can happen that meteorological situations detected by QRF_M technique lead to a unimodal PDF whereas the raw ensemble sees two different scenarios. It is the case of the forecast at Paris-Le Bourget airport where QRF_M does not take into account the (misleading) raw ensemble bimodality and fits its mode between these modalities, and is consistent with the observation. Nevertheless, we remember that we cannot evaluate ensemble forecasts on single cases. In Fig. 7, a BMA calibration was also made with the same learning sample than EMOS. If BMA technique permits bimodalities this is not the case here: we think that the deterministic forecast, the control member, and the mean of the raw ensemble are much too close to have bimodalities. BMA should be more convenient with ensembles made of several deterministic forecasts.

b. Surface wind speed
We now give results for surface wind speed. Like for surface temperature we choose to begin with an example for 24-h lead time (corresponding to 1800 UTC) at the same locations. Figures 8 and 9 and Table 2 show RH and scores for all presented methods. Mainly the commentaries are the same as for surface temperature. EMOS tends to be a little underdispersed. Figure 10 showing RH for all stations confirms that there is still a little issue with the first rank for EMOS: this is likely due to a suboptimally chosen distribution type. The square root-truncated normal distribution used here minimizes the average CRPS on whole stations. The form of this distribution may not be optimal for calibrated ensemble forecasting little wind speed. This behavior is similar to the PIT histogram in the middle of Fig. 5 of Scheuerer and Möller (2015). At the same time we can note that QRF_M dispersion is almost perfect. Figure 11 confirms the good dispersion of QRF_M. We also note that calibrated ensembles seem unbiased and QRF techniques provide reliable ensembles. Last but not least, for temperatures the CRPS skill score shows that the QRF_M method is the best in terms of CRPS for almost all locations.
We can look at the performance of probabilistic forecast of threshold 5 m s 21 for all stations and 24-h lead time. Figure 12 reveals an overprediction of threshold exceedances by raw ensemble, and this feature is corrected by calibrated ensembles. It is not shown here but the results for 10 m s 21 are as good as for 5ms 21 . We have examined the threshold of 15 m s 21 , but there are not enough observations and the reliability diagram is too noisy to be meaningful. Figure 13 shows results across lead times for surface wind speed. If conclusions are strictly the same as for surface temperature, we can add here that sharpness and entropy of QRF ensembles are better than EMOS. Last, QRF techniques are very well dispersed and reliable and FIG. 13. Mean scores with 95% bootstrap confidence intervals for all locations across lead times for surface wind speed. QRF_M is the best technique for CRPS and CRPSS. Calibrated ensembles are unbiased and in general better dispersed than raw ensemble (the raw ensemble entropy is around 0.85). QRF techniques tend to provide sharper, more reliable, and better dispersed forecasts than EMOS. thus QRF_O (and QRF_M of course) has much better CRPS than EMOS. We can explain these differences with surface temperature by the fact that finding a good parametric distribution is a bit trickier for wind speed than for temperatures and so EMOS performs less well than QRF techniques in that case.

c. Importance of the QRF predictors
One of the peculiarities of the QRF method is that we can see the most useful predictors for the model by watching the importance of predictors: the importance shows how much the mean-squared error of a whole forest increases when a predictor is randomly permuted. ''Randomly permuted'' means that the values of the given predictor are a random sample (without replacement) of the original values. Indeed, if randomly permuting a predictor does not result in a much larger mean-squared error, it means that this particular predictor is of little importance; whereas important predictors will change the quality of predictions by quite a bit if randomly permuted. Figure 14 shows the importance of QRF_O predictors for the 24-h forecast of the surface wind speed. As expected, the most important predictors are those that give information on the center of the distribution. Next, we have the month (a seasonal information) and the first and the ninth decile. It is interesting to see that information on spread or other moments is quite useless, these predictors about spread and higher moments even have same importance that artificially generated random variables (not shown here). We can explain this by the fact that spread information is already contained in decile predictors (in addition to a value on the variable of interest), and it is easier for the model to split meteorological situations by their extreme quantiles rather than their predictability summarized by a statistic such as standard deviation. It is not shown here, but Fig. 14 also applies to another lead time and the other variable, which is surface temperature (with a slightly higher seasonal importance, however).
For the QRF_M method we focus on surface temperature to show that we can detect a meteorological consistency in the QRF model. Figures 15 and 16 show the importance of two different lead times (33 h for 0300 UTC and 42 h for 1200 UTC). We can assume that both figures have quite the same shape. Indeed, we find that central parameters, deciles, and the month are important. In addition, the predictor TPW850 is also important: there is a clear link between surface temperature and potential wet-bulb temperature. Because most of the other predictors have the same importance as noisy predictors, we focus on FLIR3 and FLVIS3: for the 33-h forecast (day) FLIR3 is higher than FLVIS3 in importance but this is the contrary for the 42-h forecast (night). These differences show that the QRF model takes into account diurnal and nocturnal radiation (in terms of wavelengths). Last but not least, we note that RR3 and SN3 have small importance: these predictors are often zero and thus permuting zeros does not change anything, explaining their small importance. We can understand this phenomenon when we are looking at RR3_q90 and SN3_q90. Higher quantiles are less frequently zero and they have higher importance. Nevertheless we can keep them in the model since we remember that random forests do not choose these predictors during node splitting anyway.

Discussion
Through this article, we see that the QRF techniques and the QRF_M technique, which yields on multivariable predictors, give reliable and sharp ensembles compared to EMOS techniques. Moreover, we have noticed that the improvement is more consequent for a non-Gaussian variable like surface wind speed than for surface temperature. This improvement is quite the same among lead times showing that nonparametric calibration methods do not lose predictive performance compared to EMOS and can improve over this method. We also believe that nonparametric calibration is more useful for forecasters since output PDF is not constrained by the QRF technique. It allows us to keep the notion of the scenario for our calibrated ensembles and it can detect nonlinear phenomena. It is not just a correction of bias and dispersion for a given distribution. This nonparametric method is a data-driven technique. This may be viewed as a drawback, but the advent of big data and reforecast techniques let us think that nonparametric methods will be frequently used in order to calibrate forecast ensembles and more generally for ensemble output statistics in meteorology. The QRF technique is linked to the method of analogs (Hamill and Whitaker 2006;Delle Monache et al. 2013) in the sense that QRF is another way to find the closest observations given a set of predictors. The method of analogs consists in finding the closest past forecasts (the analogs) according to a given metric of the predictors' space to build an analog-based ensemble. The QRF technique proceeds by iterative dichotomies on the predictors' space to find the closest past forecasts. So both methods share many advantages (e.g., no parametric assumption, easily applicable to multipredictor settings, etc.) and drawbacks (large datasets). Moreover, Delle Monache et al. (2013) applied the method of analogs for surface temperature and wind speed on much smaller datasets (and with only three or four predictors) than in Hamill and Whitaker (2006) for rainfall: the size of the dataset is an issue depending on the weather variable under consideration, it will be interesting to check the performances of the analogs technique and QRF with smaller datasets than in Hamill and Whitaker (2006) for rainfall but with many more predictors (we remember that our QRF_M technique uses more than 40 predictors).
In addition, we show that it is always better to have several methods for assessing performance. Moreover, we have presented some alternatives to interpret rank histograms other than in a graphic way, by the use of entropy in particular.
As a perspective we will apply QRF techniques to other parameters (rainfall as said above) and try regional calibration: we could add, for example, predictors like longitude, latitude, and altitude to make regional QRF regroup some stations/grid points in order to have fewer (but bigger) forests and model some spatial interactions. Some works in the same vein have been published recently for EMOS (Feldmann et al. 2015). We will also try techniques for trajectory recovery in ensemble forecasts by using the nonparametric technique of ensemble copula coupling (Bremnes 2007;Krzysztofowicz and Toth 2008;Schefzik et al. 2013). We are also interested in combining bias correction for deterministic forecasts to  Acknowledgments. The authors want to thank the three anonymous reviewers for their helpful advice and remarks on this paper. Part of the work of P. Naveau has been supported by the ANR-DADA, LEFE-INSU-Multirisk, AMERISKA, A2C2, CHAVANA and Extremoscope projects. Part of the work was done when P. Naveau was visiting the IMAGE-NCAR group in Boulder, Colorado.

List of Predictors for QRF_O and QRF_M
See Tables A1 and A2 for the list of predictors.

List of Theoretical Formulas and Analytic Formulas for the CRPS for Several Distributions
The continuous ranked probability score (CRPS; Matheson and Winkler 1976;Hersbach 2000) is defined directly in terms of the predictive CDF, F, as CRPS(F, y) 5 ð '
Another representation  shows that CRPS(F, y) 5 E F jX 2 yj 2 1 2 E F jX 2 X 0 j , where X and X 0 are independent copies of a random variable with distribution F and finite first moment, respectively. Another elegant representation that we found using the L-moments (Hosking 1989) is CRPS(F, y) 5 E F jX 2 yj 2 E F fX[2F(X) 2 1]g.
Here we find some analytic formulas for the CRPS. Some of them are already known and a reference is mentioned (to the best of our knowledge) but the others have been computed. This list permits us to sum up some formulas for further studies.